labeled and unlabeled data
InterLUDE: Interactions between Labeled and Unlabeled Data to Enhance Semi-Supervised Learning
Huang, Zhe, Yu, Xiaowei, Zhu, Dajiang, Hughes, Michael C.
Semi-supervised learning (SSL) seeks to enhance task performance by training on both labeled and unlabeled data. Mainstream SSL image classification methods mostly optimize a loss that additively combines a supervised classification objective with a regularization term derived solely from unlabeled data. This formulation neglects the potential for interaction between labeled and unlabeled images. In this paper, we introduce InterLUDE, a new approach to enhance SSL made of two parts that each benefit from labeled-unlabeled interaction. The first part, embedding fusion, interpolates between labeled and unlabeled embeddings to improve representation learning. The second part is a new loss, grounded in the principle of consistency regularization, that aims to minimize discrepancies in the model's predictions between labeled versus unlabeled inputs. Experiments on standard closed-set SSL benchmarks and a medical SSL task with an uncurated unlabeled set show clear benefits to our approach. On the STL-10 dataset with only 40 labels, InterLUDE achieves 3.2% error rate, while the best previous method reports 14.9%.
Probabilistic Modeling for Face Orientation Discrimination: Learning from Labeled and Unlabeled Data
This paper presents probabilistic modeling methods to solve the problem of dis(cid:173) criminating between five facial orientations with very little labeled data. The first model maintains no inter-pixel dependencies, the second model is capable of modeling a set of arbitrary pair-wise dependencies, and the last model allows dependencies only between neighboring pixels. We show that for all three of these models, the accuracy of the learned models can be greatly improved by augmenting a small number of labeled training images with a large set of unlabeled images using Expectation-Maximization. This is important because it is often difficult to obtain image labels, while many unla(cid:173) beled images are readily available. Through a large set of empirical tests, we examine the benefits of unlabeled data for each of the models.
Comparing the Value of Labeled and Unlabeled Data in Method-of-Moments Latent Variable Estimation
Chen, Mayee F., Cohen-Wang, Benjamin, Mussmann, Stephen, Sala, Frederic, Rรฉ, Christopher
Labeling data for modern machine learning is expensive and time-consuming. Latent variable models can be used to infer labels from weaker, easier-to-acquire sources operating on unlabeled data. Such models can also be trained using labeled data, presenting a key question: should a user invest in few labeled or many unlabeled points? We answer this via a framework centered on model misspecification in method-of-moments latent variable estimation. Our core result is a bias-variance decomposition of the generalization error, which shows that the unlabeled-only approach incurs additional bias under misspecification. We then introduce a correction that provably removes this bias in certain cases. We apply our decomposition framework to three scenarios -- well-specified, misspecified, and corrected models -- to 1) choose between labeled and unlabeled data and 2) learn from their combination. We observe theoretically and with synthetic experiments that for well-specified models, labeled points are worth a constant factor more than unlabeled points. With misspecification, however, their relative value is higher due to the additional bias but can be reduced with correction. We also apply our approach to study real-world weak supervision techniques for dataset construction.
Learning From Labeled And Unlabeled Data: An Empirical Study Across Techniques And Domains
There has been increased interest in devising learning techniques that combine unlabeled data with labeled data - i.e. semi-supervised learning. However, to the best of our knowledge, no study has been performed across various techniques and different types and amounts of labeled and unlabeled data. Moreover, most of the published work on semi-supervised learning techniques assumes that the labeled and unlabeled data come from the same distribution. It is possible for the labeling process to be associated with a selection bias such that the distributions of data points in the labeled and unlabeled sets are different. Not correcting for such bias can result in biased function approximation with potentially poor performance. In this paper, we present an empirical study of various semi-supervised learning techniques on a variety of datasets. We attempt to answer various questions such as the effect of independence or relevance amongst features, the effect of the size of the labeled and unlabeled sets and the effect of noise. We also investigate the impact of sample-selection bias on the semi -supervised learning techniques under study and implement a bivariate probit technique particularly designed to correct for such bias.